{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Enhance ACS Dictionary\n",
    "\n",
    "This notebook enriches `ACS_Dictionary.csv` (Variable, Label, Concept, Group)\n",
    "by adding columns matching the project's `ACSDictionary.csv` format.\n",
    "\n",
    "**Derived from label structure:**\n",
    "- `Label_0`–`Label_6` — split `Label` on `!!`\n",
    "- `Level` — hierarchy depth starting at 0 (`_E001` → 0; others → count of `!!` separators minus 1)\n",
    "- `Parent`, `Parent^2`, `Parent^3`, `Parent^4` — nearest prior variable at previous level within group\n",
    "- `Partition` — inferred from sibling labels (e.g. Male/Female → \"Sex\")\n",
    "- `SubGroupID` — variables sharing [Group, Level, Parent] form a subgroup; ID = `{Group}_{index}`\n",
    "\n",
    "**Inferred from concept/group:**\n",
    "- `WhatTheVariableMeasures` — e.g. \"Number of People\", \"Dollars\", \"Age\"\n",
    "- `Category` — e.g. \"Demographic\", \"Education\", \"Housing\"\n",
    "\n",
    "**Generated variables:**\n",
    "- `Cumulative` — for subgroups with >2 members measuring \"Number of X\",\n",
    "  running sums are created: `(V1)`, `(V1 + V2)`, `(V1 + V2 + V3)`, etc.\n",
    "\n",
    "**Original variables:** `Cumulative=False`, `Proportional=False`\n",
    "\n",
    "**Left blank:** `Denominator`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import polars as pl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "INPUT_FILE = \"ACS_Dictionary.csv\"\n",
    "OUTPUT_FILE = \"ACSEnhanced_Dictionary.csv\"\n",
    "\n",
    "MAX_LABEL_LEVELS = 7  # Label_0 through Label_6\n",
    "\n",
    "COLUMN_ORDER = [\n",
    "    \"Variable\", \"Label\", \"Concept\", \"Group\",\n",
    "    \"WhatTheVariableMeasures\",\n",
    "    \"Label_0\", \"Label_1\", \"Label_2\", \"Label_3\", \"Label_4\", \"Label_5\", \"Label_6\",\n",
    "    \"Partition\",\n",
    "    \"Parent\", \"Parent^2\", \"Parent^3\", \"Parent^4\",\n",
    "    \"Level\",\n",
    "    \"Cumulative\", \"Proportional\", \"Denominator\",\n",
    "    \"SubGroupID\",\n",
    "    \"Category\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ACS_Dictionary.csv: 909 rows x 4 columns\n",
      "Columns: ['Variable', 'Label', 'Concept', 'Group']\n",
      "\n",
      "Sorted by Variable. First 5: ['B01002_E001', 'B01002_E002', 'B01002_E003', 'B01003_E001', 'B02001_E001']\n"
     ]
    }
   ],
   "source": [
    "dict_df = pl.read_csv(INPUT_FILE)\n",
    "print(f\"ACS_Dictionary.csv: {dict_df.shape[0]} rows x {dict_df.shape[1]} columns\")\n",
    "print(f\"Columns: {dict_df.columns}\")\n",
    "\n",
    "# Ensure sorted by Variable ID (required for parent derivation)\n",
    "dict_df = dict_df.sort(\"Variable\")\n",
    "print(f\"\\nSorted by Variable. First 5: {dict_df['Variable'].head(5).to_list()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Split Label into Label_0 through Label_6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Added Label_0 through Label_6\n",
      "shape: (5, 5)\n",
      "┌─────────────┬──────────┬───────────────┬─────────┬─────────┐\n",
      "│ Variable    ┆ Label_0  ┆ Label_1       ┆ Label_2 ┆ Label_3 │\n",
      "│ ---         ┆ ---      ┆ ---           ┆ ---     ┆ ---     │\n",
      "│ str         ┆ str      ┆ str           ┆ str     ┆ str     │\n",
      "╞═════════════╪══════════╪═══════════════╪═════════╪═════════╡\n",
      "│ B01002_E001 ┆ Estimate ┆ Median age -- ┆ Total:  ┆ null    │\n",
      "│ B01002_E002 ┆ Estimate ┆ Median age -- ┆ Male    ┆ null    │\n",
      "│ B01002_E003 ┆ Estimate ┆ Median age -- ┆ Female  ┆ null    │\n",
      "│ B01003_E001 ┆ Estimate ┆ Total         ┆ null    ┆ null    │\n",
      "│ B02001_E001 ┆ Estimate ┆ Total:        ┆ null    ┆ null    │\n",
      "└─────────────┴──────────┴───────────────┴─────────┴─────────┘\n"
     ]
    }
   ],
   "source": [
    "label_parts = dict_df[\"Label\"].fill_null(\"\").str.split(\"!!\")\n",
    "\n",
    "for i in range(MAX_LABEL_LEVELS):\n",
    "    dict_df = dict_df.with_columns(\n",
    "        label_parts.list.get(i, null_on_oob=True).alias(f\"Label_{i}\")\n",
    "    )\n",
    "\n",
    "print(\"Added Label_0 through Label_6\")\n",
    "print(dict_df.select([\"Variable\", \"Label_0\", \"Label_1\", \"Label_2\", \"Label_3\"]).head(5))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Level distribution:\n",
      "shape: (7, 2)\n",
      "┌───────┬─────┐\n",
      "│ Level ┆ len │\n",
      "│ ---   ┆ --- │\n",
      "│ f64   ┆ u32 │\n",
      "╞═══════╪═════╡\n",
      "│ 0.0   ┆ 56  │\n",
      "│ 1.0   ┆ 288 │\n",
      "│ 2.0   ┆ 406 │\n",
      "│ 3.0   ┆ 145 │\n",
      "│ 4.0   ┆ 8   │\n",
      "│ 5.0   ┆ 2   │\n",
      "│ 6.0   ┆ 4   │\n",
      "└───────┴─────┘\n",
      "\n",
      "Sample _E001 variables (should all be Level 0):\n",
      "shape: (5, 3)\n",
      "┌─────────────┬─────────────────────────────────┬───────┐\n",
      "│ Variable    ┆ Label                           ┆ Level │\n",
      "│ ---         ┆ ---                             ┆ ---   │\n",
      "│ str         ┆ str                             ┆ f64   │\n",
      "╞═════════════╪═════════════════════════════════╪═══════╡\n",
      "│ B01002_E001 ┆ Estimate!!Median age --!!Total… ┆ 0.0   │\n",
      "│ B01003_E001 ┆ Estimate!!Total                 ┆ 0.0   │\n",
      "│ B02001_E001 ┆ Estimate!!Total:                ┆ 0.0   │\n",
      "│ B05001_E001 ┆ Estimate!!Total:                ┆ 0.0   │\n",
      "│ B05002_E001 ┆ Estimate!!Total:                ┆ 0.0   │\n",
      "└─────────────┴─────────────────────────────────┴───────┘\n"
     ]
    }
   ],
   "source": [
    "# Count '!!' separators in label\n",
    "separator_count = dict_df[\"Label\"].fill_null(\"\").str.count_matches(\"!!\").cast(pl.Float64)\n",
    "\n",
    "dict_df = dict_df.with_columns(\n",
    "    pl.when(pl.col(\"Variable\").str.contains(\"_E001\"))\n",
    "    .then(pl.lit(0.0))\n",
    "    .when(separator_count == 0)\n",
    "    .then(pl.lit(0.0))\n",
    "    .otherwise(separator_count - 1)\n",
    "    .alias(\"Level\")\n",
    ")\n",
    "\n",
    "print(\"Level distribution:\")\n",
    "print(dict_df.group_by(\"Level\").len().sort(\"Level\"))\n",
    "print(\"\\nSample _E001 variables (should all be Level 0):\")\n",
    "print(\n",
    "    dict_df.filter(pl.col(\"Variable\").str.contains(\"_E001\"))\n",
    "    .select([\"Variable\", \"Label\", \"Level\"])\n",
    "    .head(5)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Compute Level\n",
    "\n",
    "- `_E001` variables are always Level 0 (group totals).\n",
    "- Other ACS variables: Level = count of `!!` separators minus 1.\n",
    "  This ensures parents (whose last label segment ends with `:`) are at a\n",
    "  lower level than their children (which add another `!!` segment).\n",
    "- Non-ACS variables (LandArea, PopulationDensity): Level 0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parent assigned: 845/909\n",
      "\n",
      "Sample hierarchy (B05002):\n",
      "shape: (10, 4)\n",
      "┌─────────────┬───────┬─────────────┬─────────────┐\n",
      "│ Variable    ┆ Level ┆ Parent      ┆ Parent^2    │\n",
      "│ ---         ┆ ---   ┆ ---         ┆ ---         │\n",
      "│ str         ┆ f64   ┆ str         ┆ str         │\n",
      "╞═════════════╪═══════╪═════════════╪═════════════╡\n",
      "│ B05002_E001 ┆ 0.0   ┆ null        ┆ null        │\n",
      "│ B05002_E002 ┆ 1.0   ┆ B05002_E001 ┆ null        │\n",
      "│ B05002_E003 ┆ 2.0   ┆ B05002_E002 ┆ B05002_E001 │\n",
      "│ B05002_E004 ┆ 2.0   ┆ B05002_E002 ┆ B05002_E001 │\n",
      "│ B05002_E005 ┆ 3.0   ┆ B05002_E004 ┆ B05002_E002 │\n",
      "│ B05002_E006 ┆ 3.0   ┆ B05002_E004 ┆ B05002_E002 │\n",
      "│ B05002_E007 ┆ 3.0   ┆ B05002_E004 ┆ B05002_E002 │\n",
      "│ B05002_E008 ┆ 3.0   ┆ B05002_E004 ┆ B05002_E002 │\n",
      "│ B05002_E009 ┆ 2.0   ┆ B05002_E002 ┆ B05002_E001 │\n",
      "│ B05002_E010 ┆ 3.0   ┆ B05002_E009 ┆ B05002_E002 │\n",
      "└─────────────┴───────┴─────────────┴─────────────┘\n"
     ]
    }
   ],
   "source": [
    "variables = dict_df[\"Variable\"].to_list()\n",
    "groups = dict_df[\"Group\"].to_list()\n",
    "levels = dict_df[\"Level\"].to_list()\n",
    "\n",
    "# Derive Parent: look backward within same group for nearest Level - 1\n",
    "parents = [None] * len(variables)\n",
    "for i in range(len(variables)):\n",
    "    if levels[i] is None or levels[i] <= 0:\n",
    "        continue\n",
    "    target_level = levels[i] - 1\n",
    "    for j in range(i - 1, -1, -1):\n",
    "        if groups[j] != groups[i]:\n",
    "            break\n",
    "        if levels[j] == target_level:\n",
    "            parents[i] = variables[j]\n",
    "            break\n",
    "\n",
    "# Derive Parent^2, Parent^3, Parent^4 transitively\n",
    "parent_lookup = dict(zip(variables, parents))\n",
    "parent2 = [parent_lookup.get(p) if p else None for p in parents]\n",
    "parent3 = [parent_lookup.get(p) if p else None for p in parent2]\n",
    "parent4 = [parent_lookup.get(p) if p else None for p in parent3]\n",
    "\n",
    "dict_df = dict_df.with_columns([\n",
    "    pl.Series(\"Parent\", parents, dtype=pl.Utf8),\n",
    "    pl.Series(\"Parent^2\", parent2, dtype=pl.Utf8),\n",
    "    pl.Series(\"Parent^3\", parent3, dtype=pl.Utf8),\n",
    "    pl.Series(\"Parent^4\", parent4, dtype=pl.Utf8),\n",
    "])\n",
    "\n",
    "filled = dict_df.filter(pl.col(\"Parent\").is_not_null()).height\n",
    "print(f\"Parent assigned: {filled}/{dict_df.height}\")\n",
    "print(\"\\nSample hierarchy (B05002):\")\n",
    "print(\n",
    "    dict_df.filter(pl.col(\"Group\") == \"B05002\")\n",
    "    .select([\"Variable\", \"Level\", \"Parent\", \"Parent^2\"])\n",
    "    .head(10)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Derive Parent Hierarchy\n",
    "\n",
    "For each variable at Level N (N > 0), the immediate parent is the nearest\n",
    "preceding variable **in the same group** at Level N−1. Grandparent and\n",
    "higher ancestors follow the same logic transitively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WhatTheVariableMeasures distribution:\n",
      "shape: (7, 2)\n",
      "┌─────────────────────────┬─────┐\n",
      "│ WhatTheVariableMeasures ┆ len │\n",
      "│ ---                     ┆ --- │\n",
      "│ str                     ┆ u32 │\n",
      "╞═════════════════════════╪═════╡\n",
      "│ Number of People        ┆ 789 │\n",
      "│ Dollars                 ┆ 50  │\n",
      "│ Number of Housing Units ┆ 48  │\n",
      "│ Number of Households    ┆ 15  │\n",
      "│ Age                     ┆ 3   │\n",
      "│ Hours                   ┆ 3   │\n",
      "│ Rooms                   ┆ 1   │\n",
      "└─────────────────────────┴─────┘\n"
     ]
    }
   ],
   "source": [
    "def infer_measure(concept: str, group: str) -> str:\n",
    "    \"\"\"Infer what a variable measures from its concept and group.\"\"\"\n",
    "    c = (concept or \"\").upper()\n",
    "\n",
    "    if \"MEDIAN AGE\" in c:\n",
    "        return \"Age\"\n",
    "    if \"ROOMS\" in c and \"MEDIAN\" in c:\n",
    "        return \"Rooms\"\n",
    "    if \"HOURS\" in c and \"MEAN\" in c:\n",
    "        return \"Hours\"\n",
    "    if (\"MEDIAN\" in c or \"MEAN\" in c or \"PER CAPITA\" in c) and any(\n",
    "        kw in c for kw in [\"INCOME\", \"EARNINGS\", \"RENT\", \"VALUE\"]\n",
    "    ):\n",
    "        return \"Dollars\"\n",
    "    if group.startswith(\"B25\"):\n",
    "        return \"Number of Housing Units\"\n",
    "    if group.startswith(\"B28\"):\n",
    "        return \"Number of Households\"\n",
    "    return \"Number of People\"\n",
    "\n",
    "\n",
    "concepts = dict_df[\"Concept\"].to_list()\n",
    "grps = dict_df[\"Group\"].to_list()\n",
    "measures = [infer_measure(c, g) for c, g in zip(concepts, grps)]\n",
    "\n",
    "dict_df = dict_df.with_columns(\n",
    "    pl.Series(\"WhatTheVariableMeasures\", measures, dtype=pl.Utf8)\n",
    ")\n",
    "\n",
    "print(\"WhatTheVariableMeasures distribution:\")\n",
    "print(dict_df.group_by(\"WhatTheVariableMeasures\").len().sort(\"len\", descending=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Infer WhatTheVariableMeasures"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "All variables mapped to a category.\n",
      "\n",
      "Category distribution:\n",
      "shape: (7, 2)\n",
      "┌─────────────────────────────┬─────┐\n",
      "│ Category                    ┆ len │\n",
      "│ ---                         ┆ --- │\n",
      "│ str                         ┆ u32 │\n",
      "╞═════════════════════════════╪═════╡\n",
      "│ Demographic                 ┆ 107 │\n",
      "│ Economic / Income           ┆ 120 │\n",
      "│ Education                   ┆ 135 │\n",
      "│ Employment / Labor          ┆ 123 │\n",
      "│ Housing                     ┆ 51  │\n",
      "│ Technology / Infrastructure ┆ 15  │\n",
      "│ Transportation / Commuting  ┆ 358 │\n",
      "└─────────────────────────────┴─────┘\n"
     ]
    }
   ],
   "source": [
    "def infer_category(concept: str, group: str) -> str | None:\n",
    "    \"\"\"Infer thematic category from concept keywords and group prefix.\"\"\"\n",
    "    c = (concept or \"\").upper()\n",
    "\n",
    "    # Keyword matching (order matters: most specific first)\n",
    "    if any(kw in c for kw in [\"COMPUTER\", \"INTERNET\", \"BROADBAND\"]):\n",
    "        return \"Technology / Infrastructure\"\n",
    "    if any(kw in c for kw in [\"TRANSPORTATION\", \"TRAVEL TIME\", \"DEPARTURE\", \"VEHICLE\", \"COMMUT\", \"PLACE OF WORK\"]):\n",
    "        return \"Transportation / Commuting\"\n",
    "    if any(kw in c for kw in [\"HEATING FUEL\", \"HOUSING COST\", \"ELECTRICITY\", \"GAS COST\", \"OCCUPANCY\", \"GROSS RENT\", \"ROOMS\", \"VALUE (DOLLARS)\"]):\n",
    "        return \"Housing\"\n",
    "    if any(kw in c for kw in [\"WORK STATUS\", \"HOURS WORKED\", \"OCCUPATION\", \"INDUSTRY\"]):\n",
    "        return \"Employment / Labor\"\n",
    "    if any(kw in c for kw in [\"EDUCATIONAL\", \"SCHOOL\", \"DEGREE\"]):\n",
    "        return \"Education\"\n",
    "    if any(kw in c for kw in [\"POVERTY\"]):\n",
    "        return \"Demographic\"\n",
    "    if any(kw in c for kw in [\"INCOME\", \"EARNINGS\"]):\n",
    "        return \"Economic / Income\"\n",
    "    if any(kw in c for kw in [\"AGE\", \"SEX\", \"RACE\", \"NATIVITY\", \"CITIZENSHIP\", \"POPULATION\", \"PLACE OF BIRTH\"]):\n",
    "        return \"Demographic\"\n",
    "\n",
    "    # Group prefix fallback\n",
    "    if group.startswith(\"B25\"):\n",
    "        return \"Housing\"\n",
    "    if group.startswith(\"B08\"):\n",
    "        return \"Transportation / Commuting\"\n",
    "    if group.startswith(\"B19\"):\n",
    "        return \"Economic / Income\"\n",
    "    if group.startswith((\"B15\", \"C15\")):\n",
    "        return \"Education\"\n",
    "    if group.startswith(\"B28\"):\n",
    "        return \"Technology / Infrastructure\"\n",
    "    if group.startswith((\"B01\", \"B02\", \"B05\")):\n",
    "        return \"Demographic\"\n",
    "    if group.startswith((\"B23\", \"C24\")):\n",
    "        return \"Employment / Labor\"\n",
    "    if group in (\"GEOINFO\", \"DERIVED\"):\n",
    "        return \"Demographic\"\n",
    "    return None\n",
    "\n",
    "\n",
    "categories = [infer_category(c, g) for c, g in zip(concepts, grps)]\n",
    "dict_df = dict_df.with_columns(\n",
    "    pl.Series(\"Category\", categories, dtype=pl.Utf8)\n",
    ")\n",
    "\n",
    "# Check for unmapped\n",
    "unmapped = dict_df.filter(pl.col(\"Category\").is_null())\n",
    "if unmapped.height > 0:\n",
    "    print(f\"WARNING: {unmapped.height} unmapped variables:\")\n",
    "    print(unmapped.select([\"Variable\", \"Concept\", \"Group\"]).unique(subset=[\"Group\"]))\n",
    "else:\n",
    "    print(\"All variables mapped to a category.\")\n",
    "\n",
    "print(\"\\nCategory distribution:\")\n",
    "print(dict_df.group_by(\"Category\").len().sort(\"Category\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Infer Category"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Infer Partition and Compute SubGroupID\n",
    "\n",
    "For each non-root variable (Level > 0), siblings are variables sharing the\n",
    "same parent. The partition is inferred from the set of siblings' last label\n",
    "segments (e.g. {Male, Female} → \"Sex\").\n",
    "\n",
    "Variables with the same [Group, Level, Parent] form a subgroup. SubGroupID\n",
    "is `{Group}_{index}` where index is 1-based within the group."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Partition inferred: 770/909\n",
      "\n",
      "Partition distribution:\n",
      "shape: (17, 2)\n",
      "┌─────────────────────────┬─────┐\n",
      "│ Partition               ┆ len │\n",
      "│ ---                     ┆ --- │\n",
      "│ str                     ┆ u32 │\n",
      "╞═════════════════════════╪═════╡\n",
      "│ Means of Transportation ┆ 158 │\n",
      "│ null                    ┆ 139 │\n",
      "│ Age                     ┆ 109 │\n",
      "│ Income                  ┆ 84  │\n",
      "│ Vehicles Available      ┆ 70  │\n",
      "│ …                       ┆ …   │\n",
      "│ Educational Attainment  ┆ 20  │\n",
      "│ Technology Access       ┆ 11  │\n",
      "│ Heating Fuel            ┆ 11  │\n",
      "│ Citizenship Status      ┆ 11  │\n",
      "│ Race                    ┆ 9   │\n",
      "└─────────────────────────┴─────┘\n",
      "\n",
      "SubGroupID: 227 unique subgroups across 909 variables\n",
      "\n",
      "Sample (B05002):\n",
      "shape: (10, 4)\n",
      "┌─────────────┬───────┬─────────────┬────────────┐\n",
      "│ Variable    ┆ Level ┆ Parent      ┆ SubGroupID │\n",
      "│ ---         ┆ ---   ┆ ---         ┆ ---        │\n",
      "│ str         ┆ f64   ┆ str         ┆ str        │\n",
      "╞═════════════╪═══════╪═════════════╪════════════╡\n",
      "│ B05002_E001 ┆ 0.0   ┆ null        ┆ B05002_1   │\n",
      "│ B05002_E002 ┆ 1.0   ┆ B05002_E001 ┆ B05002_2   │\n",
      "│ B05002_E003 ┆ 2.0   ┆ B05002_E002 ┆ B05002_3   │\n",
      "│ B05002_E004 ┆ 2.0   ┆ B05002_E002 ┆ B05002_3   │\n",
      "│ B05002_E005 ┆ 3.0   ┆ B05002_E004 ┆ B05002_4   │\n",
      "│ B05002_E006 ┆ 3.0   ┆ B05002_E004 ┆ B05002_4   │\n",
      "│ B05002_E007 ┆ 3.0   ┆ B05002_E004 ┆ B05002_4   │\n",
      "│ B05002_E008 ┆ 3.0   ┆ B05002_E004 ┆ B05002_4   │\n",
      "│ B05002_E009 ┆ 2.0   ┆ B05002_E002 ┆ B05002_3   │\n",
      "│ B05002_E010 ┆ 3.0   ┆ B05002_E009 ┆ B05002_5   │\n",
      "└─────────────┴───────┴─────────────┴────────────┘\n"
     ]
    }
   ],
   "source": [
    "from collections import defaultdict, OrderedDict\n",
    "\n",
    "# --- Infer Partition from sibling labels ---\n",
    "parents_list = dict_df[\"Parent\"].to_list()\n",
    "labels_list = dict_df[\"Label\"].fill_null(\"\").to_list()\n",
    "\n",
    "# Group variables by their parent to identify siblings\n",
    "sibling_groups = defaultdict(list)\n",
    "for i in range(len(variables)):\n",
    "    if levels[i] is None or levels[i] <= 0:\n",
    "        continue\n",
    "    parent_key = parents_list[i] if parents_list[i] else f\"_root_{groups[i]}\"\n",
    "    parts = labels_list[i].split(\"!!\")\n",
    "    last_seg = parts[-1] if parts else \"\"\n",
    "    sibling_groups[parent_key].append((i, last_seg))\n",
    "\n",
    "\n",
    "def infer_partition(sibling_segments: list[str]) -> str | None:\n",
    "    \"\"\"Infer partition dimension from the last label segments of sibling variables.\"\"\"\n",
    "    if len(sibling_segments) <= 1:\n",
    "        return None\n",
    "    segs = [s.rstrip(\":\").strip().upper() for s in sibling_segments]\n",
    "    raw = [s.rstrip(\":\").strip() for s in sibling_segments]\n",
    "\n",
    "    if {\"MALE\", \"FEMALE\"}.intersection(segs):\n",
    "        return \"Sex\"\n",
    "    if any(\"YEAR\" in s for s in segs):\n",
    "        return \"Age\"\n",
    "    race_kw = [\"WHITE\", \"BLACK\", \"ASIAN\", \"PACIFIC ISLANDER\",\n",
    "               \"AMERICAN INDIAN\", \"ALASKA NATIVE\",\n",
    "               \"TWO OR MORE RACES\", \"SOME OTHER RACE\"]\n",
    "    if any(any(kw in s for kw in race_kw) for s in segs):\n",
    "        return \"Race\"\n",
    "    if any(\"HISPANIC\" in s or \"LATINO\" in s for s in segs):\n",
    "        return \"Hispanic or Latino Origin\"\n",
    "    nat_kw = [\"NATIVE BORN\", \"FOREIGN BORN\", \"FOREIGN-BORN\",\n",
    "              \"BORN IN STATE OF RESIDENCE\", \"BORN IN OTHER STATE\",\n",
    "              \"NATIVE\", \"BORN IN\"]\n",
    "    if any(any(kw in s for kw in nat_kw) for s in segs):\n",
    "        return \"Nativity\"\n",
    "    if any(\"CITIZEN\" in s or \"NATURALIZED\" in s for s in segs):\n",
    "        return \"Citizenship Status\"\n",
    "    ed_kw = [\"GRADE\", \"DEGREE\", \"DIPLOMA\", \"BACHELOR\", \"MASTER\", \"DOCTORATE\",\n",
    "             \"ASSOCIATE\", \"HIGH SCHOOL\", \"SOME COLLEGE\", \"COLLEGE\",\n",
    "             \"KINDERGARTEN\", \"NURSERY\", \"ENROLLED\", \"NOT ENROLLED\",\n",
    "             \"LESS THAN 9TH\", \"9TH TO 12TH\", \"NO SCHOOLING\"]\n",
    "    if any(any(kw in s for kw in ed_kw) for s in segs):\n",
    "        return \"Educational Attainment\"\n",
    "    transport_kw = [\"CAR, TRUCK, OR VAN\", \"PUBLIC TRANSPORTATION\", \"BUS\",\n",
    "                    \"SUBWAY\", \"RAILROAD\", \"BICYCLE\", \"WALKED\", \"TAXICAB\",\n",
    "                    \"MOTORCYCLE\", \"FERRYBOAT\", \"STREETCAR\", \"TROLLEY\",\n",
    "                    \"WORKED FROM HOME\", \"DROVE ALONE\"]\n",
    "    if any(any(kw in s for kw in transport_kw) for s in segs):\n",
    "        return \"Means of Transportation\"\n",
    "    if any(\"OWNER\" in s or \"RENTER\" in s for s in segs):\n",
    "        return \"Tenure\"\n",
    "    if any(\"MINUTE\" in s for s in segs):\n",
    "        return \"Travel Time\"\n",
    "    if any(\"$\" in s for s in raw):\n",
    "        return \"Income\"\n",
    "    if any(\"POVERTY\" in s for s in segs):\n",
    "        return \"Poverty Status\"\n",
    "    emp_kw = [\"EMPLOYED\", \"UNEMPLOYED\", \"IN LABOR FORCE\",\n",
    "              \"NOT IN LABOR FORCE\", \"IN ARMED FORCES\", \"CIVILIAN\"]\n",
    "    if any(any(kw in s for kw in emp_kw) for s in segs):\n",
    "        return \"Employment Status\"\n",
    "    work_kw = [\"WORKED\", \"DID NOT WORK\", \"FULL-TIME\", \"PART-TIME\"]\n",
    "    if any(any(kw in s for kw in work_kw) for s in segs):\n",
    "        return \"Work Status\"\n",
    "    if any(\"VEHICLE\" in s for s in segs):\n",
    "        return \"Vehicles Available\"\n",
    "    if any(\"BEDROOM\" in s for s in segs):\n",
    "        return \"Number of Bedrooms\"\n",
    "    if any(\"ROOM\" in s for s in segs):\n",
    "        return \"Number of Rooms\"\n",
    "    if any(\"COMPUTER\" in s or \"INTERNET\" in s or \"BROADBAND\" in s for s in segs):\n",
    "        return \"Technology Access\"\n",
    "    fuel_kw = [\"UTILITY GAS\", \"BOTTLED\", \"ELECTRICITY\", \"FUEL OIL\",\n",
    "               \"COAL\", \"WOOD\", \"SOLAR\", \"NO FUEL\"]\n",
    "    if any(any(kw in s for kw in fuel_kw) for s in segs):\n",
    "        return \"Heating Fuel\"\n",
    "    if any(\"FAMILY\" in s or \"NONFAMILY\" in s or \"MARRIED\" in s for s in segs):\n",
    "        return \"Household Type\"\n",
    "    occ_kw = [\"MANAGEMENT\", \"SERVICE OCCUPATIONS\", \"SALES\", \"NATURAL RESOURCES\",\n",
    "              \"PRODUCTION\", \"MILITARY\"]\n",
    "    if any(any(kw in s for kw in occ_kw) for s in segs):\n",
    "        return \"Occupation\"\n",
    "    ind_kw = [\"AGRICULTURE\", \"CONSTRUCTION\", \"MANUFACTURING\", \"WHOLESALE\",\n",
    "              \"RETAIL\", \"INFORMATION\", \"FINANCE\", \"PROFESSIONAL\",\n",
    "              \"ARTS\", \"PUBLIC ADMINISTRATION\"]\n",
    "    if any(any(kw in s for kw in ind_kw) for s in segs):\n",
    "        return \"Industry\"\n",
    "    if any(\"UNIT\" in s for s in segs):\n",
    "        return \"Units in Structure\"\n",
    "    if any(\"WORKER\" in s for s in segs):\n",
    "        return \"Number of Workers\"\n",
    "    return None\n",
    "\n",
    "\n",
    "# Apply partition inference\n",
    "partitions = [None] * len(variables)\n",
    "for parent_key, siblings in sibling_groups.items():\n",
    "    segments = [seg for _, seg in siblings]\n",
    "    partition = infer_partition(segments)\n",
    "    for idx, _ in siblings:\n",
    "        partitions[idx] = partition\n",
    "\n",
    "dict_df = dict_df.with_columns(\n",
    "    pl.Series(\"Partition\", partitions, dtype=pl.Utf8)\n",
    ")\n",
    "\n",
    "filled_p = dict_df.filter(pl.col(\"Partition\").is_not_null()).height\n",
    "print(f\"Partition inferred: {filled_p}/{dict_df.height}\")\n",
    "print(\"\\nPartition distribution:\")\n",
    "print(dict_df.group_by(\"Partition\").len().sort(\"len\", descending=True))\n",
    "\n",
    "# --- Compute SubGroupID ---\n",
    "# Each unique (Group, Level, Parent) = one subgroup, numbered 1-based per Group\n",
    "group_subgroup_index = defaultdict(OrderedDict)\n",
    "subgroup_keys = []\n",
    "for i in range(len(variables)):\n",
    "    key = (groups[i], levels[i], parents[i])\n",
    "    subgroup_keys.append(key)\n",
    "    grp = groups[i]\n",
    "    if key not in group_subgroup_index[grp]:\n",
    "        group_subgroup_index[grp][key] = len(group_subgroup_index[grp]) + 1\n",
    "\n",
    "subgroup_ids = [\n",
    "    f\"{key[0]}_{group_subgroup_index[key[0]][key]}\" for key in subgroup_keys\n",
    "]\n",
    "\n",
    "dict_df = dict_df.with_columns(\n",
    "    pl.Series(\"SubGroupID\", subgroup_ids, dtype=pl.Utf8)\n",
    ")\n",
    "\n",
    "n_subgroups = len(set(subgroup_ids))\n",
    "print(f\"\\nSubGroupID: {n_subgroups} unique subgroups across {dict_df.height} variables\")\n",
    "print(\"\\nSample (B05002):\")\n",
    "print(\n",
    "    dict_df.filter(pl.col(\"Group\") == \"B05002\")\n",
    "    .select([\"Variable\", \"Level\", \"Parent\", \"SubGroupID\"])\n",
    "    .head(10)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--\n",
    "## 9. Generate Cumulative Variables\n",
    "\n",
    "For subgroups with more than 2 members whose measure is \"Number of X\",\n",
    "create cumulative rows for all variables except the last in the subgroup.\n",
    "The cumulative variable name lists the summed variables:\n",
    "`(V1 + V2 + ... + Vn)`.\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # --- Generate Cumulative Variables ---\n",
    "#\n",
    "# # Group row indices by SubGroupID\n",
    "# subgroup_indices = defaultdict(list)\n",
    "# for i in range(dict_df.height):\n",
    "#     sg_id = dict_df[i, \"SubGroupID\"]\n",
    "#     subgroup_indices[sg_id].append(i)\n",
    "#\n",
    "# cumulative_rows = []\n",
    "# eligible_count = 0\n",
    "#\n",
    "# for sg_id, indices in subgroup_indices.items():\n",
    "#     n = len(indices)\n",
    "#     if n <= 2:\n",
    "#         continue\n",
    "#\n",
    "#     # Check measure starts with \"Number of\"\n",
    "#     measure = dict_df[indices[0], \"WhatTheVariableMeasures\"]\n",
    "#     if not (measure or \"\").startswith(\"Number of\"):\n",
    "#         continue\n",
    "#\n",
    "#     eligible_count += 1\n",
    "#\n",
    "#     # Sort by Variable ID\n",
    "#     sorted_idx = sorted(indices, key=lambda idx: dict_df[idx, \"Variable\"])\n",
    "#\n",
    "#     # Generate n-1 cumulative rows (k = 1 to n-1 variables included)\n",
    "#     for k in range(1, n):\n",
    "#         var_ids = [dict_df[sorted_idx[j], \"Variable\"] for j in range(k)]\n",
    "#         cum_variable = \"(\" + \" + \".join(var_ids) + \")\"\n",
    "#\n",
    "#         # Last label segment of the last included variable\n",
    "#         last_label = dict_df[sorted_idx[k - 1], \"Label\"] or \"\"\n",
    "#         last_segment = last_label.split(\"!!\")[-1]\n",
    "#         cum_label = f\"Cumulative up to: {last_segment}\"\n",
    "#\n",
    "#         # Inherit metadata from the first variable in sorted order\n",
    "#         first = dict_df.row(sorted_idx[0], named=True)\n",
    "#         cum_row = dict(first)\n",
    "#         cum_row[\"Variable\"] = cum_variable\n",
    "#         cum_row[\"Label\"] = cum_label\n",
    "#         cum_row[\"SubGroupID\"] = sg_id + \"_C\"\n",
    "#         cumulative_rows.append(cum_row)\n",
    "#\n",
    "# # Mark original rows\n",
    "# dict_df = dict_df.with_columns([\n",
    "#     pl.lit(\"False\").alias(\"Cumulative\"),\n",
    "#     pl.lit(\"False\").alias(\"Proportional\"),\n",
    "#     pl.lit(None).cast(pl.Utf8).alias(\"Denominator\"),\n",
    "# ])\n",
    "#\n",
    "# # Build cumulative DataFrame and concatenate\n",
    "# if cumulative_rows:\n",
    "#     for row in cumulative_rows:\n",
    "#         row[\"Cumulative\"] = \"True\"\n",
    "#         row[\"Proportional\"] = \"False\"\n",
    "#         row[\"Denominator\"] = None\n",
    "#\n",
    "#     cum_df = pl.DataFrame(\n",
    "#         cumulative_rows,\n",
    "#         schema={c: dict_df[c].dtype for c in dict_df.columns},\n",
    "#     )\n",
    "#     dict_df = pl.concat([dict_df, cum_df])\n",
    "#\n",
    "# print(f\"Eligible subgroups (>2 members, 'Number of' measure): {eligible_count}\")\n",
    "# print(f\"Original rows: {dict_df.filter(pl.col('Cumulative') == 'False').height}\")\n",
    "# print(f\"Cumulative rows generated: {dict_df.filter(pl.col('Cumulative') == 'True').height}\")\n",
    "# print(f\"Total rows: {dict_df.height}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--\n",
    "## 9b. Generate Proportional Variables\n",
    "\n",
    "For each variable (original and cumulative) measuring \"Number of X\",\n",
    "create proportional versions by dividing by each ancestor up to the\n",
    "level-0 root. `_P1` uses the immediate parent as denominator, `_P2` the\n",
    "grandparent, etc.\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # --- Generate Proportional Variables ---\n",
    "#\n",
    "# # Parent column -> P1 (1 level up), Parent^2 -> P2, etc.\n",
    "# ancestor_cols = [\n",
    "#     (\"Parent\", 1),\n",
    "#     (\"Parent^2\", 2),\n",
    "#     (\"Parent^3\", 3),\n",
    "#     (\"Parent^4\", 4),\n",
    "# ]\n",
    "#\n",
    "# proportional_rows = []\n",
    "#\n",
    "# for i in range(dict_df.height):\n",
    "#     measure = dict_df[i, \"WhatTheVariableMeasures\"]\n",
    "#     if not (measure or \"\").startswith(\"Number of\"):\n",
    "#         continue\n",
    "#\n",
    "#     source_var = dict_df[i, \"Variable\"]\n",
    "#     source_sg = dict_df[i, \"SubGroupID\"]\n",
    "#     source_label = dict_df[i, \"Label\"] or \"\"\n",
    "#     last_segment = source_label.split(\"!!\")[-1]\n",
    "#\n",
    "#     for ancestor_col, level_diff in ancestor_cols:\n",
    "#         ancestor_var = dict_df[i, ancestor_col]\n",
    "#         if ancestor_var is None:\n",
    "#             continue\n",
    "#\n",
    "#         prop_row = dict(dict_df.row(i, named=True))\n",
    "#         prop_row[\"Variable\"] = f\"{source_var}/{ancestor_var}\"\n",
    "#         prop_row[\"Label\"] = f\"Proportion of: {last_segment}\"\n",
    "#         prop_row[\"Proportional\"] = \"True\"\n",
    "#         prop_row[\"Denominator\"] = ancestor_var\n",
    "#         prop_row[\"SubGroupID\"] = f\"{source_sg}_P{level_diff}\"\n",
    "#\n",
    "#         proportional_rows.append(prop_row)\n",
    "#\n",
    "# if proportional_rows:\n",
    "#     prop_df = pl.DataFrame(\n",
    "#         proportional_rows,\n",
    "#         schema={c: dict_df[c].dtype for c in dict_df.columns},\n",
    "#     )\n",
    "#     dict_df = pl.concat([dict_df, prop_df])\n",
    "#\n",
    "# n_prop_orig = sum(1 for r in proportional_rows if r[\"Cumulative\"] == \"False\")\n",
    "# n_prop_cum = sum(1 for r in proportional_rows if r[\"Cumulative\"] == \"True\")\n",
    "# print(f\"Proportional rows generated: {len(proportional_rows)}\")\n",
    "# print(f\"  From original variables: {n_prop_orig}\")\n",
    "# print(f\"  From cumulative variables: {n_prop_cum}\")\n",
    "# print(f\"Total rows: {dict_df.height}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--\n",
    "## 10. Save Enhanced Dictionary\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Saved ACSEnhanced_Dictionary.csv: 909 rows x 20 columns\n"
     ]
    }
   ],
   "source": [
    "# Reorder columns and save\n",
    "# dict_df = dict_df.select(COLUMN_ORDER)\n",
    "\n",
    "dict_df.write_csv(OUTPUT_FILE)\n",
    "print(f\"Saved {OUTPUT_FILE}: {dict_df.shape[0]} rows x {dict_df.shape[1]} columns\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!--\n",
    "## 11. Summary\n",
    "-->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print(\"Enhanced Dictionary Summary\")\n",
    "# print(\"=\" * 50)\n",
    "# print(f\"Total variables: {dict_df.height}\")\n",
    "# n_orig = dict_df.filter(\n",
    "#     (pl.col(\"Cumulative\") == \"False\") & (pl.col(\"Proportional\") == \"False\")\n",
    "# ).height\n",
    "# n_cum = dict_df.filter(\n",
    "#     (pl.col(\"Cumulative\") == \"True\") & (pl.col(\"Proportional\") == \"False\")\n",
    "# ).height\n",
    "# n_prop = dict_df.filter(pl.col(\"Proportional\") == \"True\").height\n",
    "# print(f\"  Original: {n_orig}  |  Cumulative: {n_cum}  |  Proportional: {n_prop}\")\n",
    "# print(f\"Columns: {dict_df.shape[1]}\")\n",
    "#\n",
    "# print(\"\\nColumn fill rates:\")\n",
    "# for col in COLUMN_ORDER:\n",
    "#     s = dict_df[col]\n",
    "#     if s.dtype == pl.Utf8:\n",
    "#         filled = s.filter(s.is_not_null() & (s != \"\")).len()\n",
    "#     else:\n",
    "#         filled = s.filter(s.is_not_null()).len()\n",
    "#     pct = filled / dict_df.height * 100\n",
    "#     marker = \"\" if pct == 100 else \" *\" if pct == 0 else \"\"\n",
    "#     print(f\"  {col}: {filled}/{dict_df.height} ({pct:.0f}%){marker}\")\n",
    "#\n",
    "# print(\"\\nSample proportional rows:\")\n",
    "# print(\n",
    "#     dict_df.filter(pl.col(\"Proportional\") == \"True\")\n",
    "#     .select([\"Variable\", \"Label\", \"Denominator\", \"SubGroupID\"])\n",
    "#     .head(10)\n",
    "# )"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
